Setting up Data and Functions

We start off, by fetching the data from wineQualityReds csv file and storing into a variable wineQualityData.

Data Summary

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Data: We have 1599 rows of data where X is the unique identifier for each wine. There are 11 metrics which decide the quality of the wine. Quality is an ordered variable where values range from 3 to 8 for our given sent of wines. The mean wine quality is 5.6.

Defining Plotting Function

We start off by exploring univariate variables to find correlations between the attributes and the quality of a wine.

Univariate Data Analysis

Distribution of Quality

Majority of the wines have quality between 5 and 6 with very few wines being really good or bad (8 or 3 respectively).

Distribution of Each Property

We analyze each property individually.

1. Fixed Acidity

From the above plot, it appears that majority of the values for fixed acidity lie in the range 5 to 14. So we limit our fixed acidity values to this range.

The median for fixed acidity is somewhere around 8 and the graph is positively skewed. Large number of values lie in the range of 7 to 9.

2. Volatile Acidity

Majority values for volatile acidity lie in the range of 0.2 to 1.

The median is around .54 and this distribution is also positively skewed.

3. Citric Acid

A lot of citric acid values appear to be zero. The data available for citric acid might be incomplete.

4. Residual Sugar

The graph for residual sugar is heavily skewed towards the left and most of the data lies in the range 1 to 5.

Even after filtering some outliers, the data is still positively skewed with a median around 2.25.

5. Chlorides

The data for chlorides is similar to that of residual sugar. We consider the data that lies between 0.04 and 0.14.

The data for this range appears to be normally distributed with a few outliers. The median is around 0.08.

6. Free Sulfur Dioxide

Most of the values for free sulfur dioxide lie in the range of 0 to 35.

In this property we see a high peak around 7-8 which gives our graph a positive skew. The median, however, is around 13. This is becuase of the long tail of values in the high range.

7. Total Sulfur Dioixide

Most of the values are in the range 0 to 100. Since free sulfur dioxide is a subset of total sulfur dioxide, we can expect to see a similar positively skewed graph for total sulfur dioxide.

Our expectation was correct in this case, we see a positively skewed graph with a high peak around 25 whereas the median is around 36. We can say that the values for total sulfur dioxide are somewhat proportional to those free sulfur dioxide.

8. Density

The data for density is normally distributed.

Both the median and the mean appear to be around 0.997. So we can positively say that our plot is normally distributed.

9. pH

The data for pH level is also normally distributed.

Both the median and the mean appear to be around 3.3. So we can positively say that our plot is normally distributed.

10. Sulphates

In this case we put our limits at 0.3 and 1.

11. Alcohol

Most of the alcohol percentage is around 9 to 11%, which is normal and a few values goind till 13.

This graph is positively skewed with a median around 10.2, which is normal beacuse most of the wines have their alcohol percentange in 9% to 11% range.

Observations:

In univariate data analysis we observed that many values for citric acid are zero, which indicates that the data might be incomplete. Many properties like fixed acidity, volatile acidity and alcohol content tend to have positive skews, which might be useful in the later part of our analysis. Another important point to note here is that total sufur dioxide and free sulfur dioxide are somewhat correlated to each other.

Bivariate Data Analysis

Property vs Quality

Density vs Quality

From the above plots we can see that wines with higher quality have low median density. We can see a negative correlation between quality and density of a wine.

Alcohol vs Quality

Higher quality wines in the dataset have higher alcohol content on average as compared to the lower quality ones. There is a positive correlation between alcohol and quality.

pH Level vs Quality

Wines are generally acidic in nature which explains that almost all pH levels are below 7 (which is neutral). We can observe that most wines have pH level within range 3 to 4, and there is a slight negative correlation.

Residual Sugar vs Quality

There are many outliers for the residual sugar property. Let’s filter out the outliers and plot the values.

The residual sugar content is almost the same for all qualities of wine.

Suplhates vs Qualtiy

Loooks like even suplhates has a lot of outliers, however we can observe a positive correlation from the boxplot. Let’s have a closer look.

Yes, our observation was correct , better quality wines have higher sulphates content.

Correlation

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
## log10.residual.sugar      log10.chlordies  free.sulfur.dioxide 
##           0.02353331          -0.17613996          -0.05065606 
## total.sulfur.dioxide              density                   pH 
##          -0.18510029          -0.17491923          -0.05773139 
##      log10.sulphates              alcohol 
##           0.30864193           0.47616632

From the above values we can say that alcohol, volatile acidity and sulphates have higher correlation with the qualtiy. We already observed that alcohol and sulphates have positive correlation with quality. Let’s have a look at volatile acidity vs quality.

Volatile Acidity vs Quality

Volatile Acidity has a strong negative correlation wrt wine quality.

Multivariate Analysis

In the previous section we observed what properties have direct effect on the quality of wines. Let’s have a look at how combinations of these factors affect the quality.

The above graph shows that wines with higher alcohol content and lower volatile acidity tend to have higher quality rating.

Good quality wines tend to have lower sulphates level. Based on the past two observations we can expect a graph of sulphates and volatile acidity to have good quality wines to be prevalent in the bottom left of the graph. Let’s have a look.

This graph stays true to our expectation. A lot of good quality wines lie in the bottom left of the graph.

Observations:

Final Plots and Summary

Plot 1: Volatile Acidity vs Qualtiy

This graph shows us a strong negative correlation between wine quality and volatile acidity. Better the wine quality, lower the volatile acidity in it.

Plot 2: Alcohol vs Quality

We observed that alcohol content has a strong postivie correlation with respect to quality. The following graph depicts that.

Plot 3: Alcohol vs Volatile Acidity vs Quality

The above plots help us understand that Volatile acidity and alcohol are the major properties that affect the quality of a wine. There are other factors like density, pH level and sulphates that also affect wine quality to some extent.

Reflections:

We were able to figure some properties that might be affecting the quality of a wine. However our dataset only had 1599 different wines, which were produced in a certain region of Portugal, which is much less than the large number of wines that are available in the market. Therefore our analysis need not necessarily apply to wines made in other countries. We also need to understand that the dataset was created by fixed group of individuals and since the taste differs from person to person, the ratings provided by this fixed group of individuals need not necessarily apply to the entire populace.